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ABSTRACT 



A computer based system and method of retrieving infor- 
mation pertaining to electronic documents on a computer 
network is disclosed. The method includes maintaining a 
database that associates each electronic document with a 
corresponding crawl number that indicates the most recent 
crawl during which a change to the document was detected. 
During a subsequent crawl, electronic documents that have 
changed since the previous crawl are retrieved, and selected 
data is stored in a database. The retrieved document infor- 
mation is marked with a crawl number. During subsequent 
searches, crawl numbers are used to determine documents 
that have changed since a specified crawl. 

29 Claims, 13 Drawing Sheets 
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METHOD OF WEB CRAWLING UTILIZING When a search is repeated, the user may prefer to avoid 

CRAWL NUMBERS locating documents that have been located by prior searches. 

It is desirable to have a mechanism by which a user can 
request a search engine to return only documents that have 

FIELD OF THE INVENTION 5 changed in some substantive way since that prior search. 

THe present invention relates to the field of network P«f rably, such a mechanism will provide a Web crawler 

information software and, in particular, to methods and ™ th » wa y 10 retneve "ft documents that may have 

systems for retrieving data from network sites. c f han S ed sulce a P revious ™* a ™} an< ? then 10 det « m '" e 

J if an actual, substantive change has been made to the 

BACKGROUND OF THE INVENTION 10 document. Th e mechanism would also preferably provide a 

way to mark the data retrieved from the document and stored 

In recent years, there has been a tremendous proliferation in an index with an identifier that could be used in a search 

of computers connected to a global network known as the of the index to indicate when the Web crawler last found a 

Internet. A "client" computer connected to the Internet can substantive change to the document. The present invention 

download digital information from "server" computers con- 35 is directed to providing such a mechanism, 
nected to the Internet. Client application software executing 

on client computers typically accept commands from a user SUMMARY OF THE INVENTION 

and obtain data and services by sending requests to server . ' . ' . , „ . 

. 3 . t jx t it. In accordance with this invention, a system and computer 

applications running on server computers connected to the , , . , c t . . , t c J m , / , 

A Lr.i , . t nn based method of retrieving data from a computer network 

Internet. A number of protocols are used to exchange com- 20 . ° £ 

. ■ « i , 1 i7 i i are provided. In an actual embodiment of the present 

mands and data between computers connected to the Inter- r . . r «/ u i u 

, ™ . i ■ i j .u -r c n t i /ctt3\ invention, the method includes performing a Web crawl, by 

net. The protocols include the File Transfer Protocol (FTP), . . t c , , . , r , ° , , 

, T . ^ rf ^ rn. win-™ *u c* i •! retrieving a set of electronic documents and subsequently 

the Hyper Text Transfer Protocol (HTTP), the Simple Mail . . & .... . , ( , m . . i 

^ c n i /ow™\ j \ , t A I j . retrieving additional electronic documents based on 

Transfer Protocol (SMTP), and the ^'Gopher document 6 .„ , . , . , . t . , . T 

\ 25 a ddresses specmed witnm each electronic document. In a 

" later Web crawl, electronic documents that have been modi- 

The HTTP protocol is used to access data on the World fied subs equent to the previous Web crawl and electronic 

Wide Web, often referred to as "the Web." The World Wide documents that were not retrieved during the previous Web 

Web is an information service on the Internet providing crawl arc retr i CV cd. Electronic documents that were deleted 

documents and links between documents. The World Wide since the previous Web crawl are detected. Each Web crawl 

Web is made up of numerous Web sites located around the ^ assigned a unique currenl crawl number. A crawl number 

world that maintain and distribute electronic documents, A modified is associated with and stored with the storage data 

Web site may use one or more Web server computers that from eacn electronic document retrieved during the Web 

store and distribute documents in one of a number of formats crawl ^ crawl number modified is set equal to the current 

including the Hyper Text Markup Language (HTML). An Cfawl num ber when the document is first retrieved, or when 

HTML document contains text and metadata or commands it has prev i ous i y been retrieved and has been found by the 

providing formatting information. HTML documents also mechanism of the invention to have been modified in some 

include embedded "links" thai reference other data or docu- sub stantive manner. In a subsequent search request, a crawl 

ments located on any Web server computers. The referenced num b er can be retained as a search parameter and compared 

documents may represent text, graphics, or video in respec- ^ againsl a crawl num ber modified that is stored with the 

live formats. document data to determine if a document has been modified 

A Web browser is a client application or operating system subsequent to the crawl number specified in the search, 

utility that communicates with server computers via FTP, In accordance w j tn ot her aspects of this invention, each 

HTTP, and Gopher protocols. Web browsers receive elec- electronic document has a corresponding document address 

tronic documents from the network and present them to a ^ specification and provides information for locating the elec- 

user. Internet Explorer, available from Microsoft lronic document. During a Web crawl, document address 

Corporation, of Redmond, Washington, is an example of a specifications are used to retrieve copies of the correspond- 

popular Web browser application. ing electronic documents. Information from each electronic 

An intranet is a local area network containing Web servers document retrieved during a Web crawl is stored in an index 

and client computers operating in a manner similar to the 5Q an d associated with the corresponding document address 

World Wide Web described above. Typically, all of the specification and with a crawl number modified. If the 

computers on an intranet are contained within a company or retrieved document contains document address specifica- 

organization. lions to linked documents included in hyperlinks, these 

Web crawlers are computer programs that retrieve numer- linked documents are also selectively retrieved during the 
ous electronic documents from one or more Web sites. A 55 Web crawl and processed in the manner described above. 
Web crawler processes the received data, preparing the data in accordance with further aspects of this invention, 
to be subsequently processed by other programs. For performing a Web crawl includes assigning a unique current 
example, a Web crawler may use the retrieved data to create crawl number to the Web crawl, and determining whether a 
an index of documents available over the Internet or an currently retrieved electronic document corresponding to 
intranet. A "search engine" can later use the index to locate 60 each previously retrieved electronic document copy is sub- 
electronic documents that satisfy a specified criteria. stantively equivalent to the corresponding previously 

A user that performs a document search provides search retrieved electronic document copy, in order to determine 

parameters to limit the number of documents retrieved. For whether the electronic document has been modified since a 

example, a user may submit a search request that includes a previous crawl. If the current electronic document is not 

list of one or more words, and the search engine locates 65 substantively equivalent to the previously retrieved elec- 

electronic documents that contain a specified combination of ironic document copy, and therefore has been modified, the 

the words. A user may repeat a search after a period of time. document's associated crawl number modified is set to the 
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current crawl number and stored in the index with the data 
from the current electronic document. 

In accordance with still other aspects of this invention, a 
secure hash function is used to determine a hash value 
corresponding to each retrieved electronic document copy. 
The hash value is stored in the index and used in subsequent 
Web crawls to determine whether the corresponding elec- 
tronic document is modified. The current electronic docu- 
ment is retrieved and used to obtain a new hash value, which 
is compared with the previously determined hash value 
corresponding to the associated document address specifi- 
cation that is stored in a history map. If the hash values arc 
equal, the current electronic document is considered to be 
substantively equivalent to the previously retrieved elec- 
tronic document copy. If the hash values differ, the current 
electronic document is considered to be modified and the 
current crawl number is associated with the newly retrieved 
electronic document as the crawl number modified. The 
crawl number modified indicates the crawl number of the 
last crawl in which the data in the document was found to 
have changed. The hash value is stored with the associated 
data from the retrieved document and stored in the index. 
Preferably, hash functions are applied to data from electronic 
documents after selected data has been filtered out, so that 
filtered out data is not represented in the hash values, and is 
therefore not considered in comparisons. For instance, for- 
matting information contained in the retrieved document 
could be filtered out before the hash value is computed. 

In accordance with further aspects of this invention, 
during an incremental crawl, prior to retrieving an electronic 
document copy, the time stamp of the current electronic 
document is compared with the previously stored time stamp 
of a previously retrieved electronic document corresponding 
to the current electronic document. If the respective time 
stamps match, the current electronic document is considered 
to be substantively equivalent to its corresponding previ- 
ously retrieved electronic document copy, and is therefore 
not retrieved during the current incremental crawl. 
Preferably, the comparison of time stamps is performed by 
sending a request to a server to transfer the current electronic 
document if the time stamp associated with the current 
electronic document is more recent than a time stamp 
included in the request. 

As will be readily appreciated from the foregoing 
description, a system and method formed in accordance with 
the invention for retrieving data from electronic documents 
on a computer network provide an efficient way of retrieving 
and storing information pertaining to electronic documents, 
wherein the retrieval of electronic documents that have 
previously been retrieved is minimized. The invention 
allows a Web crawler to perform crawls in less time and to 
perform more comprehensive crawls. Assigning a crawl 
number modified to a retrieved document that is set to the 
current crawl number when the document has been retrieved 
and found to have been modified in some substantive way 
since the last time it was retrieved by the invention or if it 
is the first time the document is retrieved advantageously 
reduces search and document retrieval time. 

Storing the crawl number modified with the document 
data enables a user to perform a subsequent search using a 
crawl number as a search criteria. This allows a user to 
search only for documents that have substantively changed 
since a previous search. For instance, a user could run a first 
search requesting documents that meet a particular query. 
'lTie intermediate agent that queries the search engine could 
retain the crawl number of the most recent crawl made by 
the web crawler along with recording the search query. A 



30 



35 



45 



65 



second search performed at a later time could run the same 
query as the first search, but with the intermediate agent 
implicitly adding the retained crawl number as a search 
criteria. The resulting search will only return documents 
with an associated crawl number modified that is subsequent 
to the retained crawl number. Because the crawl number 
modified associated with a document only changes when a 
subsequent Web crawl finds that it has changed in a sub- 
stantive way, the second search would only return docu- 
ments that have actually changed since the first search. The 
present invention offers other advantages over solely relying 
on the timestarnp of the document to search for new docu- 
ments. For instance, a search that requests only documents 
with a timestarnp subsequent to the date of a prior search 
would not return any new documents found by the Web 
crawler but having timestamps that are earlier than the date 
of the last search. 

BRIEF DESCRIPTION OF THE DRAWING 

The foregoing aspects and many of the attendant advan- 
tages of this invention will become more readily appreciated 
as the same becomes belter understood by reference to the 
following detailed description, when taken in conjunction 
with the accompanying drawings, wherein: 

FIG. 1 is a block diagram of a general purpose computer 
system for implementing the present invention; 

FIG. 2 is a block diagram illustrating a network 
architecture, in accordance with the present invention; 

FIG. 3 is a block diagram illustrating some of the com- 
ponents used in the invention; 

FIG. 4 illustrates an exemplary history map in accordance 
with the present invention; 

FIG. 5 illustrates an exemplary transaction log in accor- 
dance with the present invention; 

FIG. 6 is a flow diagram illustrating the process of 
performing a first full crawl; 

FIG. 7 is a flow diagram illustrating the process of 
performing a Web crawl; 

FIGS. 8a and Sb are flow diagrams illustrating the pro- 
cessing of URLs, in accordance with the invention; 

FIG. 9 is a flow diagram illustrating the processing during 
a Web crawl of URLs that are linked in an electronic 
document; 

FIG. 10 is a flow diagram illustrating the process of 
performing a full crawl, in accordance with the invention; 

FIG. 11 is a flow diagram illustrating the process of 
performing an incremental crawl, in accordance with the 
invention; and 

FIG. 12 is a flow diagram illustrating the process of 
performing a search for electronic documents, in accordance 
with the present invention. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

The present invention is a mechanism for obtaining 
information pertaining to electronic documents that reside 
on one or more server computers. While the following 
discussion describes an actual embodiment of the invention 
that crawls the Internet within the World Wide Web, the 
present invention is not limited to that use. This present 
invention may also be employed on any type of computer 
network or individual computer having data stores such as 
files systems, e-mail messages and databases. The informa- 
tion from all of these different stores can be processed by the 
invention together or separately. 
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A server computer is referred to as a Web site, and the crawl number modified that is subsequent to a stored crawl 
process of locating and retrieving digital data from Web sites number associated with a prior search, 
is referred to as "Web crawling." The mechanism of the Web crawler programs execute on a computer, preferably 
invention initially performs an first full crawl wherein a a general purpose personal computer. FIG. 1 and the fol- 
transaction log is "seeded" with one or more document 5 lowing discussion are intended to provide a brief, general 
address specifications. Each document listed in the transac- description of a suitable computing environment in which 
lion log is retrieved from its Web site and processed. The the invention may be implemented. Although not required, 
processing includes extracting the data from each of these the invention will be described in the general context of 
retrieved documents and storing that data in an index, or computer-executable instructions, such as program modules, 
other database, with an associated crawl number modified 1Q being executed by a personal computer. Generally, program 
that is set equal to a unique current crawl number that is modules include routines, programs, objects, components, 
associated with the first full crawl. A hash value for the d a t a structures, etc. that perform particular tasks or imple- 
document and the document's lime stamp are also stored men t particular abstract data types. Moreover, those skilled 
with the document data in the index. The document URL, its j n the art will appreciate that the invention may be practiced 
hash value, its time stamp, and its crawl number modified ]5 w j t h other computer system configurations, including hand- 
are stored in a persistent history map that is used by the held devices, multiprocessor systems, microprocessor-based 
crawler to record the documents that it has crawled. or programmable consumer electronics, network PCs, 

Subsequent to the first full crawl, the invention can minicomputers, mainframe computers, and the like. The 
perform any number of full crawls or incremental crawls. invention may also be practiced in distributed computing 
During a full crawl, the transaction log is "seeded"with one 20 environments where tasks are performed by remote process- 
or more document address specifications, which are used to ing devices that are linked through a communications net- 
retrieve the document associated with the address specifi- work. In a distributed computing environment, program 
cation. The retrieved documents are recursively processed to modules may be located in both local and remote memory 
find any "linked" document address specifications contained storage devices. 

in retrieved document. The document address specification 25 with reference to FIG. 1, an exemplary system for implc- 

of the linked document is added to the transaction log the menting the invention includes a general purpose computing 

first time it is found during the current crawl. The full crawl device in the form of a conventional personal computer 20, 

builds a new index based on the documents that it retrieves including a processing unit 21, a system memory 22, and a 

based on the "seeds" in its transaction log and the project system bus 23 that couples various system components 

gathering rules that constrain the search. During the course 30 including the system memory to the processing unit 21. The 

of the full crawl, the document address specifications of the system bus 23 may be any of several types of bus structures 

documents that are retrieved are compared to associated including a memory bus or memory controller, a peripheral 

entries in the history map (if there is an entry), and a crawl bus, and a local bus using any of a variety of bus architec- 

number modified is assigned as is discussed in detail below. tljres The system memory includes read only memory 

An incremental crawl retrieves only electronic documents 35 (ROM) 24 and random access memory (RAM) 25. A basic 

that may have changed since the previous crawl. The incre- input/output system 26 (BIOS), containing the basic routines 

mental crawl uses the existing index and history map and its that helps to transfer information between elements within 

transaction log is seeded with the document address speci- the personal computer 20, such as during startup, is stored in 

fications contained in the history map. In an incremental ROM 24. The personal computer 20 further includes a hard 

crawl, a document Ls retrieved from a Web site if its lime 40 disk drive 27 for reading from and writing to a hard disk, not 

stamp is subsequent to the time stamp stored in the Web shown, a magnetic disk drive 28 for reading from or writing 

crawler's history map. During an incremental crawl, a to a removable magnetic disk 29, and an optical disk drive 

document is preferably only retrieved from a Web site if the 30 for reading from or writing to a removable optical disk 31 

time stamp on the document on the Web site is different than such as a CD ROM or other optical media. The hard disk 

the time stamp that was recorded in the history map for that 45 drive 27, magnetic disk drive 28, and optical disk drive 30 

URL. If the time stamp differs, the document is retrieved are connected to the system bus 23 by a hard disk drive 

from the Web server. interface 32, a magnetic disk drive interface 33, and an 

During the Web crawls, the invention determines if an optical drive interface 34, respectively. The drives and their 

actual substantive change has been made to the document. associated computer-readable media provide nonvolatile 

This is done by filtering extraneous data from the document 50 storage of computer readable instructions, data structures, 

(e.g., formatting information) and then computing a hash program modules and other data for the personal computer 

value for the retrieved document data. This newly computed 20. Although the exemplary environment described herein 

hash value is then compared against the hash value stored in employs a hard disk, a removable magnetic disk 29 and a 

the history map. Different hash values indicate that the removable optical disk 31, it should be appreciated by those 

content of the document has changed, resulting in the crawl 55 skilled in the art that other types of computer-readable media 

number modified stored with the document data being reset which can store data that is accessible by a computer, such 

to the current crawl number assigned to the Web crawl. as magnetic cassettes, flash memory cards, digital versatile 

Searches of the database created by the Web crawler can disks, Bernoulli cartridges, random access memories 

use the crawl number modified as a search parameter if a (RAMs), read only memories (ROM), and the like, may also 

user is only interested in documents that have changed, or 60 be used in the exemplary operating environment, 

that have been added, since a previous search. Since the A number of program modules may be stored on the hard 

invention only changes the crawl number modified associ- disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 

ated with the document when it is first retrieved, or when it including an operating system 35, one or more application 

has been retrieved and found to be modified, the user can programs 36, other program modules 37, and program data 

search for only modified documents. In response to this 65 38. A user may enter commands and information into the 

request, the intermediate agent implicitly adds a limitation to personal computer 20 through input devices such as a 

the search that the search return only documents that Save a keyboard 40 and pointing device 42. Other input devices 
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(not shown) may include a microphone, joystick, game pad, computer program that maintains an index 210 of electronic 

satellite dish, scanner, or the like. These and other input documents. The index 210 is similar to the index in a book, 

devices are often connected to the processing unit 21 and contains reference information and pointers to corre- 

through a serial port interface 46 that is coupled to the sponding electronic documents to which the reference infor- 

system bus, but may be connected by other interfaces, such 5 mation applies. For example, the index may include 

as a parallel port, game port or a universal serial bus (USB). keywords, and for each keyword a list of addresses. Each 

A monitor 47 or other type of display device is also address can be used to locate a document that includes the 

connected to the system bus 23 via an interface, such as a keyword. The index may also include information other than 

video adapter 48. One or more speakers 57 are also con- keywords used within the electronic documents. For 

nected to the system bus 23 via an interface, such as an audio Q example, the index 210 may include subject headings or 

adapter 56. In addition to the monitor and speakers, personal category names, even when the literal subject heading or 

computers typically include other peripheral output devices category name is not included within the electronic docu- 

(not shown), such as printers. meat. The type of information stored in the index depends 

The personal computer 20 operates in a networked envi- upon the complexity of the indexing engine, which may 

ronment using logical connections to one or more remote 35 analyze the contents of the electronic document and store the 

computers, such as remote computers 49 and 60. Each results of the analysis. 

remote computer 49 or 60 may be another personal A client computer 214, such as the personal computer 20 

computer, a server, a router, a network PC, a peer device or (FIG. 1), is connected to the server computer 204 by a 

other common network node, and typically includes many or computer network 212. The computer network 212 may be 

all of the elements described above relative to the personal 20 a local area network, a wide area network, or a combination 

computer 20, although only a memory storage device 50 or of networks. The computer network 212 may be the same 

61 has been illustrated in FIG. 1. The logical connections network as the computer network 216 or a different network, 

depicted in FIG. 1 include a local area network (LAN) 51 The client computer 214 includes a computer program, such 

and a wide area network (WAN) 52, Such networking as a "browser** 215 that locates and displays documents to a 

environments are commonplace in offices, enterprise- wide ?5 user. 

computer networks, intranets and the Internet. As depicted in When a user at the client computer 214 desires to search 

FIG. 1, the remote computer 60 communicates with the for one or more electronic documents, the client computer 

personal computer 20 via the local area network 51. The transmitsdata to a search engine 230 requesting a search. At 

remote computer 49 communicates with the personal com- that time, the search engine 230 examines its associated 

puter 20 via the wide area network 52. 30 index 210 to find documents that may be desired by a user. 

When used in a LAN networking environment, the per- The search engine 230 may then return a list of documents 

sonal computer 20 is connected to the local network 51 to the browser 215 at the client computer 214. The user may 

through a network interface or adapter 53. When used in a then examine the list of documents and retrieve one or more 

WAN networking environment, the personal computer 20 desired electronic documents from remote computers such 

typically includes a modem 54 or other means for establish- 35 as the remote server computer 218. 

ing communications over the wide area network 52, such as As will be readily understood by those skilled in the art of 

the Internet. The modem 54, which may be internal or computer network systems, and others, the system illustrated 

external, is connected to the system bus 23 via the serial port in FIG. 2 is exemplary, and alternative configurations may 

interface 46. In a networked environment, program modules also be used in accordance with the invention. For example, 

depicted relative to the personal computer 20, or portions 40 the server computer 204 itself may include electronic docu- 

thereof, may be stored in the remote memory storage device. ments 232 and 234 that are accessed by the Web crawler 

It will be appreciated that the network connections shown program 206. Also the Web crawler program 206, the 

are exemplary and other means of establishing a communi- indexing engine 208, and the search engine 230 may reside 

cations link between the computers may be used. on different computers. Additionally, the Web browser pro- 

FIG. 2 illustrates an exemplary architecture of a net- 45 gram and the Web crawler program 206 may reside on a 
worked system in which the present invention operates. A single computer. Further, the indexing engine 208 and search 
server computer 204 includes a Web crawler program 206 engine 230 are not required by the present invention. The 
executing thereon. The Web crawler program 206 searches Web crawler program 206 may retrieve electronic document 
for electronic documents distributed on one or more com- information for usages other than providing the information 
puters connected to a computer network 216, such as the 50 to a search engine. As discussed above, the client computer 
remote server computer 218 depicted in FIG. 2. The com- 214, the server computer 204, and the remote server com- 
puter network 216 may be a local area network 51 (FIG. 1), puter 218 may communicate through any type of commu- 
a wide area network 52, or a combination of networks that nication network or communications medium, 
allow the server computer 204 to communicate with remote FIG. 3 illustrates, in further detail, a Web crawler program 
computers, such is the remote server computer 218, either 55 206 and related software executing on the server computer 
directly or indirectly. The server computer 204 and the 204 (FIG. 2) that performs Web crawling and indexing of 
remote server computer 218 are preferably similar to the information in accordance with the present invention. As 
personal computer 20 depicted in FIG. 1 and discussed illustrated in FIG. 3, the Web crawler program 206 includes 
above. a "gatherer"process 304 that performs crawling of the Web 

The Web crawler program 206 searches remote server 60 and gathering of information pertaining to electronic docu- 

computers 218 connected to the network 216 for electronic meats. The gatherer process 304 is invoked by passing it one 

documents 222 and 224. The Web crawler 206 retrieves or more starting document address specifications, i.e., URLs 

electronic documents and associated data. The contents of 306. The starting URLs 306 serve as seeds, instructing the 

the electronic documents 222 and 224, along with the gatherer process 304 where to begin its Web crawling 

associated data, can be used in a variety of ways. For 65 process. A starting URL can be a universal naming conven- 

example, the Web crawler 206 may pass the information to tion (UNC) directory, a UNC path to a file, or an HTTP path 

an indexing engine 208. An indexing engine 208 is a to a URL. The gatherer process 304 inserts the starting URLs 
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306 into a transaction log 310, which maintains a list of to the history map and are marked as not having been 

URLs that are currently being processed or have not yet been crawled during the current crawl. They are also added to the 

processed. The transaction log 310 functions as a queue. It transaction log 310, to be subsequently processed by a 

is called a log because it is preferably implemented as a worker thread. As discussed below, a history map includes 

persistent queue that is written and kept in a nonvolatile 5 crawl number crawled data and crawl number modified data, 

storage device such as a disk to enable recovery after a ^ crawl number crawled data indicates the most recent 

system failure. Preferably, the transaction queue maintains a crawl number dunn S whicn the . URL was processed. The 

„ ,, ■ mamnn , „„,vir * rt t u 0 MVt crawl number modified data indicates the most recent crawl 

small in-memory cache tor quick access to tne next trans- , , . ... , , , . 

actions number during which a modified electronic document was 

in retrieved. Use of the history map 308 works to prevent the 

The gatherer process 304 also maintains a history map io same URL from b - ^ more than once duri a 

308, which contains an ongoing list of all URLs that have craw ] 

been searched during the current Web crawl and previous ~- * , , ,. , n ^ „ 

. .._ & , . . , r The worker thread 312 then passes the list of properties 

crawls. Ine gatherer process 304 includes one or more . % no - . - • 

i ,l J ,n .u . ,„„, -a , and text to the indexing engine 208. The indexing engine 

worker threads 312 that process each URL. The worker - no . . . . • . . , , , <u ^ ? 

,tt^t r , * . . , 15 208 creates an index 210, which is used by the search engine 

thread 312 retrieves a URL from the transaction log 310 and 35 . . , , 

.1. ■ yn t «, , ci. j 230 in subsequent searches, 

passes the URL to a filter daemon 314. The filter daemon A . . 

314 is a process that uses the URL to retrieve the electronic A 4 ' 11 " strates aD exemplary h^tory map 308 m accor- 

document at the address specified by the URL. The filter ?ancc with the present mvennon. Preferably, the h.story map 

daemon 314 uses the access method specified by the URL to 308 * s,or , ed ! n a memory so that it b persistant 

retrieve the electronic document. For example, if the access 20 acr °f s mulll P le c » wls and s y stera ^tdowns. As dep.cted, 

method is HTTP, the filter daemon 314 uLs HTTP com- ,be h*ory inap 308 mcludes multiple entr.es 410, one entry 

mands to retrieve the document. If the access method ^"T^md,^, ^ hh P f 

specified is FILE, the filter daemon uses file system com- 410 lnc ! ud f da,a 412 SP^S ihe addresses of 

mands to retrieve the corresponding documents. The File electtomc documents and time stamp data 414 correspond- 

T r ti ^ ,/ t . m . ^7 Ul _ „_ii i. 2s ing to each electronic document. The time stamp data 414 

Iransfer Protocol (r I r) is another other well known access M « . t e . r . ( 

tu^A *u„* <si,„ A^ mnn m o„ « rt «t™,o , A n „, specifies the time stamp of the corresponding electronic 

method that the filter daemon may use to retrieve a docu- ? , t v . ;..,?/ . , 

~ a . ruu~, nm(rtortlp „ * rt . A „ , • ^ document at the most recent time that the Web crawler 

ment. Other access protocols may also be used in conjunc- , , , , . * . • . . 

... S J retrieved the electronic document. A history map entry 410 

lion with the invention. , . i . . . , j , / f , j 

also includes hash value data 416 and crawl number crawled 

After retrieving an electronic document, the filter daemon dala 418 ^ craw] number crawled data 418 specifies the 
parses the electronic document and returns a list of text and mQSl rcccm Cfawl dufing wfaich the corrcsponding URL was 
properties. An HTML document includes a sequence of proccssecL ^ discussed below, the crawl number crawled 
properties or "tags," each tag containing some information. data 41g prevents dupl i cate processing of URLs during a 
The information may be text that is to be displayed in the Cfawlj and allows a craw , lQ be completed . When a crawl is 
Web browser program 215 (FIG. 2). The information may compleled) tbe crawI number craw i ed data 4 18 correspond- 
also be "metadata" that describes the formatting of text. The mg t0 each entry m tfae history map 308 is equal to lhe 
information within tags may also contain hyperlinks to other curfent CTawl number( unless the crawler did nol find a link 
electronic documents. A hyperlink includes a specification lQ lhe ^^0^ doc ument. 

of a Web address. If tbe tag containing a hyperlink ^ is an ^ h]s ^ ^ Q ^ numbcr modificd 

image, the Web browser program 215 uses the hyperlink to daU m A cfawl number modified data 42Q dfies tQe 

retrieve the image and render it on the Web page. Similarly, most ^ ^ Qumber dufi wfaich ^ c ondi 

the hyperlink may specify the address of audio data. If a ^ determined to be modifiedt In 

hyperlink points to audio data, the Web browser program qq ^ ^ Qumber d ^ 41g ifies {h& 

retneves the audio data and plays it. most receQt cfawl oumber iQ whicfa |hfi document was 

An "anchor^' tag specifies a visual element and a hyper- 45 pr0C essed. The crawl number crawled data 418 is set to the 

fink. The visual element may be text or a hyperlink to an current craw i number each time the document is processed, 

image. When a user selects an anchor having an associated ^ crawl nurn ber modified 420 is only set to the current 

hyperlink in a Web browser program 215,, the Web browser crawi num ber when the document is found to have changed, 

program automatically retrieves an electronic document at -j^ use 0 f craw j num bers is explained in further detail 

the address specified in the hyperlink. 5Q below. 

Tags may also contain information intended for a search As noted above, the history map entry 410 also includes 

engine. For example, a tag may include a subject or category hash value data 416. lhe hash value data 416 specifies a 

within which the electronic document falls, to assist search hash value corresponding to the electronic document speci- 

engines that perform searches by subject or category. The f ie d by the URL 412. A hash value results from applying a 

information contained in tags is referred to as "properties" of 55 "hash function" to the electronic document. A hash function 

the electronic document. A electronic document is therefore ^ a mathematical algorithm that transforms a digital docu- 

considcred to be made up of a set of properties and text. The men t { n \ 0 a smaller representation of the document. The 

filter daemon 314 returns the list of properties and text smaller representation of the document is the hash value 

within an electronic document to the worker thread 312. corresponding to the document. A "secure hash function" is 

As discussed above, an electronic document may contain 60 a hash function that is designed so that it is computationally 

one or more hyperlinks. Therefore, the list of properties unfeasible to find two different documents that "hash" to 

includes a list of URLs that are included in hyperlinks within produce identical hash values. A hash value produced by a 

the electronic document. The worker thread 312 passes this secure hash function serves as a "digital fingerprint" of the 

list of URLs to the history map 308. When a new or modified document. If two separately produced hash values are 

electronic document is retrieved, the history map 308 checks 65 equivalent, one can be certain to a very high degree of 

each hyperlink URL to determine if it is already listed within probability that the documents used to produce the respec- 

the history map. URLs that are not already listed are added tive hash functions are exactly the same. Similarly, if two 
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hash values are not the same, the corresponding documents 
are not exactly the same. As discussed in further detail 
below, the mechanism of the invention saves a hash value 
corresponding to an electronic document, and compares the 
stored hash value with a new hash value computed from a 
newly retrieved document, in order to determine whether the 
documents are equivalent, and therefore whether the elec- 
tronic document has changed. In one actual embodiment of 
the invention, a secure hash function known as "MD5" is 
used to create hash values. The MD5 secure hash function is 
published by RSA Laboratories of Redwood City, Calif., in 
a document entitled RFC 1321. 

An exemplary transaction log 310 is shown in FIG. 5. The 
transaction log 310 contains a plurality of entries 510 that 
each represent a document to visit during the Web crawl. In 15 
an actual embodiment of the invention, each entry 510 in the 
transaction log contains the URL of the document to be 
processed, a status data 514 that is marked when the entry 
510 is processed, an error code data 516 that indicates any 
errors encountered during processing, a user name data 518 
and a encoded password data 520. The user name data 518 
and the encoded password data 520 can be used during 
processing to access secure Web sites. One skilled in the art 
will appreciate that additional fields can be added to the data 
entries 410 and 510, as may be required by the particular 
application of the invention. 

The broad "types" of Web crawls performed by the 
present invention can be conveniently described as a "first 
full crawl" (FIG. 6), the creates and fills both an instance of 
the index 210 and an instance of the history map 308, a "full 
crawl "(FIG. 10) that fills a new instance of the index 210 
while using the existing history map 308, and an "incre- 
mental crawl" (FIG. 11) that updates the existing index 210 
as it revisits the URLs contained in the existing history map 
308 and checks for changes to the documents. Once initial- 
ized as a first full crawl, a full crawl, or an incremental 
crawl, the method and system of the Web crawl described in 
FIGS. 7-9 is essentially the same for all types of Web crawls 
performed by the invention. 

The first full crawl 610 is shown in FIG. 6. In a step 612, 
the gatherer 304 creates a new transaction log 310 and a new 
history map 308, neither of which have any preexisting 
entries 410 or 510. The transaction log is then loaded with 
one or more entries 510 containing "seed" URLs 512 in a 
step 614. The inserted URLs 512 are referred to as "seeds" 
because they act as starting points for the Web crawl. During 
the Web crawl, the Web crawler 206 will recursively gather 
and visit the URLs that are referenced in documents that the 



310. Specifically, at the step 706, a worker thread 312 
retrieves a URL 512 from an unprocessed entry 510 in the 
transaction log 310 and passes the URL to the processing 
illustrated in FIGS. 8a and Hb at a step 708. Each entry 510 
in the transaction log 310 is processed in this manner until 
it is detected in a decision step 712 that all the entries 510 
in the transaction log 310 have been processed. 

Although the process 620 is discussed herein with refer- 
ence to a single worker thread 312, preferably the mecha- 
nism of the invention includes multiple worker threads 312, 
each worker thread, in conjunction with other components, 
performing the Web crawl illustrated in FIGS. 7-9. Each 
worker thread retrieves a URL from the transaction log (step 
706), processes the URL as described above (step 708), and 
then continues to retrieve and process URLs until there are 
none left in the transaction log (step 712). The number of 
worker threads may depend upon the configuration and 
available resources of the computer system. 

FIGS. 8a and 8b illustrate the step 708 of processing a 
URL retrieved from the transaction log 310 during a first full 
crawl. FIGS. 8a and Hb will also be discussed below to with 
reference to the full crawl (FIG. 10) and the incremental 
crawl (FIG. 11). At a step 802, a determination is made of 
whether the URL 512 has been processed during the current 
25 crawl by checking the history map crawl number crawled 
data 418 for the current crawl number. If the crawl number 
crawled 418 corresponding to the URL matches the current 
crawl number, the URL has been processed during the 
current crawl. If the crawl number crawled 418 does not 
30 match the current crawl number, or if the history map 308 
docs not contain an entry for the URL, the URL has not been 
processed during the current crawl. If the URL has been 
crawled during the current crawl, the process 708 is com- 
plete for the URL. 
35 Documents 222, 224 may be retrieved under the system 
and method of the present invention either conditionally or 
unconditionally. A first full crawl and a full crawl retrieve 
the documents 222, 224 unconditionally, while an incremen- 
tal crawl retrieves documents conditionally base on a com- 
40 parison of time stamps. This is discussed in detail below 
with reference to FIG. 11. In the first full crawl, the decision 
step 803 passes control to a step 806 where the URL is 
unconditionally retrieved. Decision step 803 is illustrated for 
the convenience of showing that the documents in the 
45 transaction log 310 are unconditionally retrieved during the 
first full crawl. The document is unconditionally retrieved in 
the step 806 when the decision step 804 determines that the 
URL 512 has not been previously retrieved (e.g., because 



there is no entry 410 for the URL in the history map 308 or 

Web crawler 206 gathers (FIG. 9). As is know to those 5Q the entry in the history map 308 has a zero value in the crawl 

skilled in the art, web crawls may also be limited to specific number crawled data 418). If the retrieval of the document 

crawl parameters that define, for instance, the type of web ^ succeS sful in the step 808, the decision block 810 passes 

documents to be crawled. control to a step 812 (FIG. Sb). 

In a step 616, corresponding entries 410 are made in the At step 812 (FIG. 8B), the filter daemon 314 filters the 

history map 308 for each of the seed entries 510 made in the 55 new electronic document. The worker thread 312 then 

transaction log 310. The history map entries 410 are initial- calculates a hash value from the filtered data received from 

ized so that the time stamp data 414, the hash value data 416, the filter daemon 314 at a step 814. As discussed above, the 

the crawl number crawled data 418 and the crawl number worker thread preferably uses a secure hash function, such 

modified filed 420 are all set equal to zero or an equivalent as MD5, to calculate the hash value. At a step 820 (FIG. 86), 

"empty" or "null" value. These initialized values 414, 416, 60 me hash value 416 of the previously retrieved corresponding 

418 and 420 will influence the way in which the entry 510 electronic document 410 is compared with the new hash 

will be processed in the Web crawl in FIGS. 7-9. A new value calculated at the step 814 and a determination is made 

index 210 is created in a step 618 and the Web crawl is 0 f whether the hash values are equal. Equal hash values 

performed in a step 620. indicate that the filtered data corresponding to the newly 

'ITie Web crawl performed in the step 620 is illustrated in 65 retrieved electronic document is the same as the filtered data 

FIG. 7. At a step 706, the Web crawler 206 begins a loop of corresponding to the previously retrieved version of the 

retrieving and processing URLs from the transaction log electronic document. Because this is a first full crawl and the 
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hash value was initialized to zero, the decision block 820 a step 1018, the Web crawl is performed in substantially the 

will determine that the hash values are not equal. Since the same manner as that illustrated in FIG. 7, which is discussed 

hash values are not equal, the crawl number modified 420 in above. 

the history map 308 is set to be the current crawl number in Returning to FIGS. 8a and 86 to illustrate the processing 

a step 826. The document information is then stored in the 5 of the URLs (step 708) during the full web crawl, the steps 

index 210 in step 822. The entry 410 in the history map 308 illustrated in FIG. Ha and steps 812 and 814 shown in FIG. 

associated with the URL 412 is updated in a step 824 with 86 are performed during the full crawl in substantially the 

the new hash value 416 (calculated in step 814), the docu- same manner as is discussed above for the first full crawl 

ment time stamp 414 (retrieved with the document), and the ( FIG * ^ unsuccessful retrieval of a document is 

crawl number modified 420 (set in the step 826). 10 detected in lhe decision block 810. At a step 816 a deter- 

At a step 828, the URLs that are included as hyperlinks in m ™ lioD ; s ma ^ of whether the document still exists. If the 

the newly retrieved electronic document are processed. The doc t umcnt ™ ^ CX f ' ? * 8 * f?™ 8 410 

. r ,. ,. , , IIUI ... , 0 S 0 ■ ... ... pertaining to the document are deleted from the history map 

processing of the linked URLs at the step 828 is illustrated C no . . . - 1A . , j 

; ei^ 11 a a- ju i i t.1 i oni .•. . . j . 308 and the index 210. The entry 510 is then marked as 

in HG. 9 and discussed below. In a block 830, the status data , , . . ... c ,„ . ' . . . 

. , ,. . _ in . . . . . . completed in a block 830. An error code 516 can also be 

514 for the entry 510 being processed is marked as com- 15 . ...... . <-i j 

i . j r. -j L ■ j ■ . i. . nt. j . • r ii inserted into the error code filed 516. 
pleted. Besides being used in the step 712 to determine if all , . . , , „„, , 
the entries 510 have been processed, marking the entries 510 If a detcrmmation cannot be made at the step 816, the 
as they arc completed assists in a recovery from a system process 708 is complete for the URL and the entry 510 is not 
failure by allowing the crawler to continue the crawl from marked as complete. This may occur, for example, if corn- 
where it left off. This is possible because the transaction log 20 mumcation with a server cannot be established. The pro- 
310 is persistently stored on a storage medium (e.g., 27, 28 £,ess.ng ° f \_ lhjs URL * c ° m P' e,e .^T 1 * the sle P * 19 
or 30) and updated as the crawl proceeds. After step 830, the Beca f e th L e ^ 510 ** lh * U ^ " no1 « 
processing of the URL is finished, and control returns to complete, the URL may be retrieved by tile worker thread 
decision block 712 in FIG. 7, where the processing of the 312 at a ,at f r ,im « m ^Processing of the URLs m the 
next URL begins 25 transactloa I°S- The number or retrieval attempts for any 
rt .„ * ,.,»t r given URL can be limited to a predetermined number. After 
Y G. 9 illustrates the processing .of the linked URLs (step , hLs predeternl j ned number js reached> lhe en , 510 js 
722) contained within an electronic document 222, 22* At mafked fls c , ele and erfor code fa ^ , he erfor 
a step 902 a linked URL. is retrieved from the filtered data code daU 51fi ^ ^ described fa 81 „ 81fi m 
passed back from the .niter daemon 314. At a step 904 a and m 

is performed in substantially the same manner for 

determination is made of whether the history map 308 >u of Wcb crawls rformcd b lhc ^^0,,. 

contains the lmked URL. If the history map does not contain ic f, . . . t . . . . 

,u r i jimr ♦ * .u i i j imi • aa a . If the document is successfully retrieved, the document is 

the linked URL, at a step 906, the linked URL is added to , . . , ru . . t J D -. . ' , , . - 

.... i no a • i_ . A*ti • • i* j retrieved and filtered in the step 812 and the hash value of 

the history map 308 and the entry 410 is initialized as . , t . , . . . r , A , . 

j u m. i- i j irm • i aa a *. *i_ the document is computed in the step 814. At a step 820, the 

discussed above. The lmked URL is also added to the . . . K , , t . . t f. ... v ' 

. 4 4 ftno 35 hash value 416 computed the last time the document was 

transaction log 310 at a step 908. , . ^ t , , . . , , t . 4 

& r retrieved is compared with the new hash value calculated at 

If, at the step 904, it is determined that the history map tne step 814 and a determination is made of whether the hash 

308 contains the linked URL, then at a step 910, a deter- values arc cqual Equal hash valucs indicatc tnat the filtered 

mination is made of whether the crawl number crawled in dala corresponding to the newly retrieved electronic docu- 

the history map 308 associated with that URL is set to the 4Q mem ^ the same as the filtered data corresponding to the 

current crawl number. A negative determination indicates previously retrieved version of the electronic document. In 

that the linked URL has not yet been processed during the one aclual embodiment of the invention, at a step 824, even 

current crawl and the crawl number crawled is set to the if the hash values are equal> data from the electronic 

current crawl number m a step 912 and the URL is added to doC ument is indexed along with the newly computed hash 

the transaction log 310 m the step 908. If the crawl number 45 value and doC ument time stamp. If the document was 

crawled 318 is equal to the current crawl number, the URL changed, as indicated by the new hash value (from step 

has already been added to the transaction log 310 and the 814 ) 5eing the same ^ lhe hash vaIue 416 stored in the 

step 908 is skipped and the processing proceeds to step 914. history map 308 the previous vaIue 0 f the crawl number 

This prevents the same URL from being added to the modified 420 (stored in the history map 308) is added to the 

transaction log 310 more than once dunng the same Web 5Q i nde x, along with the filtered data, hash value, and document 

craw l- lime stamp at a step 822. The electronic document may 

Processing continues in step 914 after step 908 or step therefore have a time stamp that is more recent than its crawl 

910. At step 914, a determination is made of whether there number modified, for example, if the time stamp has 

are any additional linked URLs in the filtered data. If any changed but the filtered data is unchanged, 

additional linked URLs exist, processing returns to step 902, 55 a determination at step 820 that the new bash value does 

to begin the retrieval and processing of the next linked URL. n ot equal the old hash value 416 indicates that the filter data 

If, at step 914, there are no more linked URLs to process, the corresponding to the newly retrieved electronic document is 

processing 722 of the linked URLs within the filtered data different from the filter data corresponding to the previously 

corresponding to an electronic document is complete. retrieved version of the electronic document. If the hash 

A "full crawl" 1010 is illustrated in PIG. 10. The full 60 values are not equal, the crawl number modified 420 in the 

crawl begins at a step 1012 by inserting one or more seed history map 308 is set to be the current crawl number in a 

URLs 512 into entries 510 in the transaction log 310. The step 826. This change made to the crawl number modified 

full crawl creates a new index 210 each time it runs at a step 420 indicates that the document was found to have changed 

1014. Unlike the first full crawl (FIG. 6), which is discussed in a substantive way. The document information is then 

above, the full crawl 1010 opens an existing history map 308 65 stored in the index 210 in step 822, as described above, but 

in a step 10 16 which it uses during the processing of the the crawl number modified that is stored with the data has 

entries in the transaction log 310 (FIGS. 8a, Hb, and 9). In been set to the current crawl number in step 826. The entry 
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410 associated with the URL in history map 308 is updated 
in a step 824 with the new hash value 416, the document 
time stamp 414, and the crawl number modified 420. At a 
step 828, the URLs that are included as hyperlinks in the 
newly retrieved electronic document arc processed as dis- 
cussed above with reference to FIG. 9. The entry 510 is then 
marked as processed in the block 830. 

FIG. 11 illustrates a process 1110 of performing an 
"incremental crawrin accordance with the present inven- 
tion. An incremental crawl is performed subsequent to a full 
crawl or an incremental crawl, for the purpose of retrieving 
new documents or documents that have been modified since 
the previous crawl. The incremental crawl uses an existing 
history map 310 that is opened in a step 1112. At a step 1114, 
the transaction log 310 is "seeded" by copying all of the 
history map entries 410 into the transaction log as entries 
510. The history map 308 is updated during the crawl and 
saved between crawls. Therefore, the history map 308 
contains the URLs corresponding to all electronic docu- 
ments retrieved in the previous full and incremental crawls 
except for those deleted as described above. Copying the 
entire history map 308 to the transaction log 310 is a way of 
instructing the worker threads 312 to process URLs corre- 
sponding to electronic documents that have previously been 
retrieved. 

A purpose of the incremental crawl 1110 is to update the 
existing index 210, which is opened for update it a step 1116. 
The Web crawl illustrated in FIG. 7 is then performed in 
substantially the same manner as is described above. Refer- 
ring to FIGS. Sa and Sb to discuss the processing of the 
URLs (step 708) in an incremental crawl, the incremental 
crawl differs from the first full crawl and the full crawl in 
that it can conditionally retrieve documents based on time 
stamps as illustrated in steps 803, 804 and 808 of FIG. 8. The 
remaining steps shown in FIGS. 8a and Sb are performed in 
substantially the same manner as is discussed above with 
reference to the first full crawl (FIG. 6) and the full crawl 
(FIG. 10). Since this is an incremental crawl 1110, the 
decision block 803 directs the program control to a step 804. 
At a step 804, a determination is made of whether the 
electronic document corresponding to the URL has been 
retrieved prior to the current crawl. If the history map 308 
does not contain an entry for the URL, the corresponding 
electronic document has not been retrieved prior to the 
current crawl and is unconditionally retrieved in the step 
806. 

If at step 804 it is determined that the electronic document 
corresponding to the URL 512 has been reprieved prior to the 
current crawl, at a step 808 the worker thread passes the 
URL 512 and its associated time stamp 414 to the filter 
daemon 314, which conditionally retrieves the electronic 
document corresponding to the URL 412. In particular, the 
retrieval of the electronic document is conditional upon an 
indication that the electronic document has been modified, 
based upon a saved time stamp 414 of the electronic 
document. As discussed above, the history map 308 is 
persistent across crawls and system shutdowns. A history 
map entry 410 (FIG. 4) includes a time stamp 414 of the 
electronic document. When an electronic document is 
retrieved using the HTTP protocol, the Web server passes 
the electronic document with a time stamp that indicates the 
most recent time at which the electronic document has been 
modified. When a history map entry 410 is created, the time 
stamp is stored in the entry 414. 

In one actual embodiment of the invention, at step 808, 
when the electronic document is retrieved using the HTTP 
protocol, an HTTP "Get ff-Modified-Since" command is 
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sent from the Web crawler 206 to the Web server addressed 
by the URL. This command includes a specification of a 
time stamp. The Web server receiving this command com- 
pares the received time stamp with the time stamp of the 
corresponding electronic document on the Web server. The 
Web server transmits the corresponding electronic document 
to the Web crawler only if a comparison of the time stamps 
indicates that the electronic document has been updated 
since the date and time specified by the received time stamp. 

Similarly, when the FILE protocol is used to retrieve an 
electronic document, file system attributes are used to deter- 
mine whether the electronic document has a file date more 
recent than the time stamp stored in the history map. A 
similar determination is made when other protocols are used 
to retrieve an electronic document. 

At a step 810, the worker thread 312 determines whether 
an electronic document is received at the step 808. If a 
document is not retrieved, at a step 816 a determination is 
made of whether the document still exists. If the document 
no longer exists, at a step 818, entries pertaining to the 
document are deleted from the index 210 and the history 
map 308. If the document still exists, but has not been 
retrieved because the time stamp is unchanged, this is 
detected in decision step 819 and the entry 510 is marked as 
complete in the step 830. If a determination cannot be made 
at the step 816, the process 708 is complete for the URL. 
This may occur, for example, if communication with a 
server cannot be established. Because the entry 510 for this 
URL is not marked as complete, the worker thread can 
attempt to retrieve the URL again later, subject to the 
predefined limits discussed above. 

A determination at step 810 that a new document has been 
retrieved indicates that the new electronic document has a 
lime stamp more recent then the stored time stamp of the 
previous version of the electronic document. Some Web 
servers do not support the HTTP "Get If-Modified-Since" 
command, and always transfer an electronic document in 
response to this command. Therefore, receiving a new 
electronic document at step 808 and determining that a new 
electronic document is retrieved at step 810 does not guar- 
antee that the retrieved document has a more recent time 
stamp. However, processing continues at a step 812, under 
the assumption that the newly retrieved electronic document 
may have a more recent time stamp. 

The remaining steps illustrated in FIG. 8a, Sb, and 9 are 
performed during the incremental crawl 1110 in substan- 
tially the same manner as is discussed above with reference 
to the first full crawl (FIG. 6) and the full crawl (FIG. 10). 

FIG. 12 illustrates an exemplary process 1202 of handling 
a Web search request in accordance with the present inven- 
tion. At a step 1204, a search engine 230 (FIG. 2) receives 
a search request from a client application such as the web 
browser 215. If the user wishes to receive only those 
documents that have changed in some substantive way since 
the last time the search request was run, the Web browser 
215 (or other server or client application) sending the search 
request implicitly adds a clause to the search request that 
limits the search to only return those documents that have a 
crawl number modified that is greater than a stored crawl 
number associated with the last time the search request was 
processed by the search engine 230 (step 1205). The stored 
crawl number is retained in a search request history 250 
(FIG. 2) and represents the crawl number of the most recent 
crawl that preceded the last time that the search request was 
processed. 

At a step 1206, the search engine 230 searches the index 
210 for entries matching the specified criteria. The search 
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engine 230 returns to the client computer 214 search results 
that include zero, one, or more "hits" at a step 1208. Each 
hit corresponds to an electronic document that matches the 
search criteria. A "match"includes having a crawl number 
modified that is more recent than the stored crawl number 
specified in the search request. After the search is performed, 
at a step 1210, the client application 215 implicitly asks the 
search engine 230 to return the crawl number of the most 
recently performed crawl, which it then stores with the 
search request in a search request history. 

While the preferred embodiment of the invention has been 
illustrated and described, it will be appreciated that various 
changes can be made therein without departing from the 
spirit and scope of the invention as defined by the appended 
claims. 

The embodiments of the invention in which an exclusive 
property or privilege is claimed are denned as follows: 

1. A computer based method of retrieving information 
from a computer network (Web) having a plurality of 
electronic documents stored thereon, wherein each elec- 
tronic document has a corresponding document address 
specification that provides information for locating the elec- 
tronic document, the method including performing a current 
Web crawl comprising: 

assigning a current crawl number to the current Web 
crawl, said current crawl number being the next num- 
ber in a numerical sequence of numbers; 

determining whether an electronic document has been 
retrieved during a previous Web crawl and associated 
with a crawl number modified; 

if the electronic document has not been retrieved during 
a previous Web crawl and associated with a crawl 
number modified, associating the current crawl number 
with the electronic document as its crawl number 
modified; 

if the electronic document has been retrieved during a 
previous Web crawl and associated with a crawl num- 
ber modified, determining whether the actual content of 
the electronic document has been modified subsequent 
to the previous retrieval; and 

if the actual content of the electronic document has been 
modified subsequent to the previous retrieval, associ- 
ating the current crawl number with the electronic 
document as its crawl number modified. 

2. The method of claim 1, wherein determining whether 
the actual content of the electronic document has been 
modified subsequent to the previous retrieval comprises: 

determining whether the electronic document has an 
associated time stamp matching a previously stored 
time stamp associated with the electronic document; 

if the electronic document does not have an associated 
time stamp matching the previously stored time stamp, 
retrieving the electronic document by using a document 
address specification; and 

if the electronic document has an associated time stamp 
matching the previously stored time stamp, not retriev- 
ing the electronic document. 

3. The method of claim 1, wherein determining whether 
the actual content of the electronic document has been 
modified subsequent to the previous retrieval comprises: 

determining current representation data corresponding to 
the electronic document; and 

comparing the current representation data corresponding 
to the electronic document with previous representation 
data corresponding to the electronic document and 
determined prior to performing the current Web crawl. 
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4. The method of claim 3, wherein the current represen- 
tation data is a hash value and determining the representa- 
tion data comprises performing a hash function. 

5. The method of claim 4, wherein the hash function is a 
secure hash function. 

6. The method of claim 3, wherein determining represen- 
tation data comprises: 

filtering out selected data from the electronic document; 
and 

determining representation data representative of data 
from the electronic document that has not been filtered 
out. 

7. The method of claim 6, wherein filtering out selected 
data includes filtering out text format specification data. 

8. The method of claim 1, wherein determining whether 
the actual content of the electronic document has been 
modified subsequent to the previous retrieval comprises: 

(a) determining whether the electronic document has an 
associated time stamp matching a previously stored 
time stamp associated with the electronic document; 

(b) if the electronic document does not have an associated 
time stamp matching the previously stored time stamp, 
performing a document comparison by: 

(i) retrieving the electronic document; 

(ii) determining current representation data correspond- 
ing to the electronic document; and 

(iii) comparing the current representation data corre- 
sponding to the electronic document with previous 
representation data corresponding to the electronic 
document and determined prior to the current Web 
crawl. 

9. The method of claim 2, wherein determining whether 
the actual content of the electronic document has been 
modified subsequent to the previous retrieval further com- 
prises: 

sending a request to a server to transfer the electronic 
document, wherein the transfer is based on whether the 
lime stamp associated with the electronic document is 
more recent than a time stamp included in the request; 
and 

in the event that the server does not transfer the electronic 
document, determining that the electronic document 
has not been modified. 

10. The method of claim 1, further comprising: 
receiving a request to retrieve a list of electronic docu- 
ments that match a query, wherein the query includes a 
criteria to match electronic documents that have been 
modified subsequent to performing a previous Web 
crawl; and 

in response to receiving the request to retrieve a list of 
electronic documents, retrieving a set of document 
address specifications corresponding to electronic 
documents having an associated crawl number modi- 
fied assigned to the current Web crawl. 

11. The method of claim 1, further comprising: 
receiving a request to retrieve a list of electronic docu- 
ments that have been modified subsequent to perform- 
ing a previous Web crawl; and 

in response to receiving the request to retrieve a list of 
electronic documents, retrieving a set of document 
address specifications corresponding to electronic 
documents having an associated crawl number modi- 
fied assigned to a crawl more recent than said previous 
Web crawl. 
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12. The method of claim 1, further comprising: 

receiving, prior to performing the current Web crawl, a 
first request to retrieve a list of electronic documents 
that match a specified criteria; 

in response to receiving the first request, providing a list 5 
of electronic documents that match the specified crite- 
ria; 

receiving, after the current Web crawl, a second request to 
retrieve a list of electronic documents that match the 
specified criteria; ]0 

in response to receiving the second request, retrieving a 
second list of electronic documents that were modified 
after the current Web crawl and that match the first 
specified criteria; and 

providing the second list of electronic documents. 15 

13. The method of claim 1, wherein performing the 
current Web crawl further comprises: 

(a) determining at least one hyperlink contained within 
the electronic document, each hyperlink including a 
hyperlink document address specification; 20 

(b) determining whether each hyperlink document 
address specification included corresponds to an elec- 
tronic document retrieved prior to the current Web 
crawl; 

25 

(c) in the event that the hyperlink document address 
specification corresponds to a linked electronic docu- 
ment retrieved prior to the current Web crawl, process- 
ing the hyperlink document address specification, said 
processing comprising: 

(i) determining whether the actual content of the linked 30 
electronic document has been modified subsequent 

to the prior retrieval of the electronic document; and 

(ii) in the event that the actual content of the linked 
electronic document has been modified, storing data 
from the linked electronic document and associating 35 
the current crawl number to the linked electronic 
document. 

14. A computer based method of retrieving information 
from a computer network (Web) having a plurality of 
electronic documents stored thereon, wherein each elec- 40 
tronic document has a corresponding document address 
specification that provides information for locating the elec- 
tronic document, the method comprising: 

(a) performing a Web crawl, wherein performing the Web 
crawl includes: 45 

(i) assigning a current crawl number to the Web crawl, 
said current crawl number establishing an order in 
which the Web crawl occurred; 

(ii) retrieving at least a portion of information con- 
tained within each of a plurality of electronic docu- 50 
ments that have not previously been retrieved in a 
prior Web crawl; 

(iii) retrieving at least a portion of information con- 
tained within each of a plurality of electronic docu- 
ments that have been modified subsequent to a prior 55 
Web crawl; and 

(iv) storing, in an index, the information retrieved from 
each of the plurality of electronic documents that 
have not been previously retrieved in a prior Web 
crawl and each of the plurality of electronic docu- 60 
ments that have been modified subsequent to a prior 
Web crawl and associating the information with a 
crawl number modified that corresponds to the cur- 
rent crawl number assigned to the Web crawl; and 

(b) in response to receiving, subsequent to said Web 65 
crawl, a request to retrieve a list of electronic docu- 
ments that have been modified subsequent to said prior 
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Web crawl, selectively retrieving, from the index, said 
information corresponding to electronic documents that 
have a corresponding crawl number modified that 
exceeds the current crawl number of the said prior Web 
crawl. 

15. The method of claim 14, selectively retrieving, from 
the index, said information corresponding to electronic 
documents that have a corresponding crawl number modi- 
fied that exceeds the current crawl number of the said prior 
Web crawl, comprises retrieving portions of information 
contained within said electronic documents having an asso- 
ciated crawl number modified that exceeds the current crawl 
number of the said prior Web crawl. 

16. A computer-readable medium having computer- 
executable instructions for retrieving information from a 
computer network (Web), wherein retrieving information 
from the computer network includes performing a current 
Web crawl, wherein performing the current Web crawl 
comprises: 

assigning a current crawl number to the current Web 

crawl, said current crawl number establishing an order 

in which the Web crawl occurred; 
receiving a document address specification corresponding 

to an electronic document stored on the computer 

network; 

determining whether the electronic document has been 
retrieved during a previous Web crawl; 

if the electronic document has not been retrieved during 
a previous Web crawl, storing data from the electronic 
document and associating the data from the electronic 
document with a crawl number modified corresponding 
to the current crawl number assigned to the current Web 
crawl; 

if the electronic document has been retrieved during a 
previous Web crawl, determining whether the actual 
content of the electronic document has been modified 
subsequent to the previous Web crawl; and 

if the actual content of the electronic document has been 
modified subsequent to the previous Web crawl, storing 
data from the electronic document and associating the 
data from the electronic document with a crawl number 
modified corresponding to the current crawl number 
assigned to the current Web crawl. 

17. 'Itie computer-readable medium of claim 16, wherein 
the computer-executable instructions for determining 
whether the actual content of the electronic document has 
been modified comprises computer-executable instructions 
for: 

retrieving the electronic document; 

calculating a current hash value corresponding to the 
electronic document; 

comparing the current hash value with a previously deter- 
mined hash value corresponding to the electronic docu- 
ment; 

if the current hash value matches the previously deter- 
mined hash value, determining that the actual content 
of the electronic document is not modified; and 

if the current hash value does not match the previously 
determined hash value, determining that the actual 
content of the electronic document is modified. 

18. The computer-readable medium of claim 16, wherein 
the computer-executable instructions for determining 
whether the actual content of the electronic document has 
been modified comprises computer-executable instructions 
for: 

filtering out selected data from the electronic document; 
and 

calculating the current hash value based on data from the 
electronic document that has not been filtered out. 
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19. The computer- readable medium of claim 16, having 
further computer-executable instructions for: 

receiving a request to retrieve a list of electronic docu- 
ments that have been modified subsequent to perform- 
ing a previous Web crawl; and 

in response to receiving the request to retrieve a list of 
electronic documents, retrieving a set of document 
address specifications corresponding to electronic 
documents having an associated crawl number modi- 
fied that is equal to or greater than the current crawl 
number assigned to the previous Web crawl. 

20. The computer-readable medium of claim 16, having 
further computer-executable instructions for: 

receiving a request to retrieve a list of electronic docu- 
ments that have been modified subsequent to perform- 
ing a previous Web crawl; and 

in response to receiving the request to retrieve a list of 
electronic documents, filtering out document address 
specifications corresponding to electronic documents 
having an associated crawl number modified that 
matches the current crawl number assigned to said 
previous Web crawl. 

21. A system for retrieving information stored on a 
computer network (Web), the system comprising: 

(a) a computer network (Web) including at least one 
server having a plurality of electronic documents stored 
thereon, including a first electronic document, each 
electronic document having a corresponding Web 
address; 

(b) a database containing information corresponding to 
the plurality of electronic documents, including infor- 
mation corresponding to the first electronic document; 
and 

(c) a crawler program for performing a current Web crawl, 
the crawler program comprising computer-executable 
instructions for: 

(i) assigning a current crawl number to the current Web 
crawl, the current crawl number establishing an 
order in which the Web crawl occurred; 

(ii) retrieving a Web address corresponding to the first 
electronic document; 

(iii) determining whether the first electronic document 
has information corresponding to it in the database; 

(iv) if the first electronic document does not have 
information corresponding to it in the database, 
storing information corresponding to the first elec- 
tronic document in the database, including a crawl 
number modified that corresponds to the current 
crawl number; 

(v) if the first electronic document has information 
corresponding to it in the database, determining 
whether the first electronic document is more recent 
than the database information corresponding to the 
first electronic document; and 

(vi) if the first electronic document is more recent than 
the database information corresponding to the first 
electronic document, storing information corre- 
sponding to the first electronic document in the 
database, including a crawl number modified that 
corresponds to the current crawl number. 

22. The system of claim 21, wherein the crawler program 
further comprises computer-executable instructions for: 

retrieving a previously calculated hash value correspond- 
ing to the first electronic document from the database; 

calculating a new hash value corresponding to the first 
electronic document; and 

if the new hash value is different from the previously 
calculated hash value, determining that the first elec- 
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tronic document is more recent than the database 
information corresponding to the first electronic docu- 
ment. 

23. The system of claim 21, wherein the crawler program 
further comprises computer-executable instructions for fil- 
tering the first electronic document to exclude a portion of 
the data contained within the first electronic document prior 
to calculating the new hash value corresponding to the first 
electronic document. 

24. The system of claim 21, further comprising a search 
engine containing computer-executable instructions for: 

determining a set of electronic documents corresponding 
to a specified criteria, the specified criteria including a 
specification of a crawl number modified; and 
retrieving a list of electronic documents based on the 
specified criteria, including the specification of the 
crawl number modified. 

25. The method as recited in claim 14, wherein the crawl 
number is a next number in a numerical sequence of 
numbers. 

26. The computer-readable medium as recited in claim 16, 
wherein the crawl number is a next number in a numerical 
sequence of numbers. 

27. The system as recited in claim 21, wherein the crawl 
number is a next number in a numerical sequence of 
numbers. 

28. A computer based method of retrieving information 
from a computer network (Web) having a plurality of 
electronic documents stored thereon, wherein each elec- 
tronic document has a corresponding document address 

30 specification that provides information for locating the elec- 
tronic document, the method comprising: 

(a) performing a Web crawl, wherein performing the Web 
crawl includes: 

(i) assigning a current crawl number to the Web crawl, 
said current crawl number establishing an order in 
which the Web crawl occurred; 

(ii) retrieving at least a portion of information con- 
tained within each of a plurality of electronic docu- 
ments that have not previously been retrieved in a 
prior Web crawl; 

(iii) retrieving at least a portion of information con- 
tained within each of a plurality of electronic docu- 
ments that have been modified subsequent to a prior 
Web crawl; and 

(iv) storing, in an index, the information retrieved from 
each of the plurality of electronic documents that 
have not been previously retrieved in a prior Web 
crawl and each of the plurality of electronic docu- 
ments that have been modified subsequent to a prior 
Web crawl and associating the information with a 
crawl number modified that corresponds to the cur- 
rent crawl number assigned to the Web crawl; and 

(b) obtaining a request to retrieve of list of electronic 
documents that have been modified subsequent to an 
identified Web crawl; 

(c) associating a crawl number with the identified Web 
crawl; and 

(d) selectively retrieving, from the index, information 
corresponding to electronic documents that have a 
corresponding crawl number modified that exceeds the 
current crawl number of the crawl number associated 
with the identified Web crawl. 

29. The method as recited in claim 28, wherein assigning 
a current crawl number includes assigning a next number in 

65 a numerical sequence of numbers. 
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